Application direct access to sata drive

ABSTRACT

A networked database management system (DBMS) and supporting infrastructure is disclosed. At least one application in the disclose DBMS can directly access a pinned RDMA buffer for network reads. In addition, an application can directly access pinned DMA buffer for drive reads. The nodes of the DBMS are configured in a particular configuration to aid in high speed accesses. In addition, all data is stored in register width fields, or integer multiples thereof. Finally, at least one application in the disclosed DBMS system includes a drive access class. The drive access class includes a NVME drive access subclass and a SATA drive access subclass. The NVME drive access subclass allows the application to directly access NVME drives without making an operating system call, while the SATA drive access subclass allows the application to directly access SATA drives without making an operating system call.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit and priority of U.S. PatentApplication No. 63/403,328, entitled “APPLICATION DIRECT ACCESS TONETWORK RDMA MEMORY,” filed Oct. 3, 2016, which is hereby incorporatedby reference in its entirety. This application also claims the benefitand priority of U.S. Patent Application No. 62/403,231, entitled “HIGHLYPARALLEL DATABASE MANAGEMENT SYSTEM,” filed Oct. 3, 2016, which ishereby incorporated by reference in its entirety.

FIELD OF THE DISCLOSURE

The present disclosure generally relates to networked databasemanagement systems (DBMS) and supporting infrastructure. Moreparticularly, the present disclosure relates to computer softwareapplication access to resources, such as memory and disk. Moreparticularly still, the present disclosure relates to efficient accessby a database application to memory and storage and a database networktopology to support the same. Even more particularly still, the presentdisclosure relates to efficiently aligned direct access to network anddisk structures by a database application, for example, accessing NVME(Non-Volatile Memory Express) or SATA (Serial Advanced TechnologyAttachment) directly through a kernel bypass, as well as a computersoftware architecture to accomplish the same.

BACKGROUND

A DBMS is a suite of computer programs that are designed to manage adatabase, which is a large set of structured data. In particular, a DBMSis designed to quickly access and analyze data on large amounts ofstored data. Most modern DBMS systems comprise multiple computers(nodes). The nodes generally communicate via a network, which will use anetwork protocol, such as HTTP, or raw TCP/IP. Information that isexchange between nodes is exchanged by packets, the specific format ofwhich will be determined by the specific protocol used by the network.The data wrapped in the packet will generally be compressed to thegreatest extent possible to preserve network bandwidth. Accordingly,when it has been received, it will have to be formatted for use by thereceiving node. A variety of DBMSs and the underlying infrastructure tosupport them are well known in the art. Database input/output (“I/O”)systems comprise processes and threads that identify, read, and writeblocks of data from storage; e.g., spinning magnetic disk drives,network storage, FLASH drives, or cloud storage.

Like many software systems, DBMS evolved from standalone computers, tosophisticated client/server setups, to cloud systems. An example of acloud based DBMS is depicted in FIG. 1. In particular, a cloud system 2will generally comprise a variety of nodes (computers) as well assoftware that operates on the nodes. The cloud system 2 will comprisenumerous separate nodes, including multiple database servers 1. Eachdatabase server will maintain separate storage (not depicted), whichwill store some part of the maintained data. Various clients can accessthe cloud system 2 through the Internet 4. Clients can include, forexample, a standard desktop or laptop computer 6, a mobile device 7, aswell as various sensors 8 and control equipment 9.

Generally, DBMSs operate on computer systems (whether standalone,client/server, or cloud) that incorporate operating systems. Operatingsystems, which are usually designed to work across a wide variety ofhardware, utilize device drivers to abstract the particular functions ofhardware components, such as, for example, disk controllers, and networkinterface cards. As drivers are generally accessed through an operatingsystem, such accesses will typically entail significant resourceoverhead such as a mode switch; i.e., a switch from executingapplication logic to operating system logic, or a context switch; i.e.,the pausing of one task to perform another. Such switches are typicallytime consuming; sometimes on the order of milliseconds of processortime.

Data stored in a DBMS is usually stored redundantly, using, for example,a RAID controller, Storage Area Network (“SAN”) system, or disperseddata storage. In addition, other measures to ensure that data is storedcorrectly are usually taken as well. For example, many DBMSs utilize awrite log. A write log, which is generally written before the actualdatabase is updated, contains a record of all changes to the database,so that a change can be easily backed out, or, in case of a transactionprocessing failure, can be redone as needed. In addition, writing to thelog prior to committing the data guarantees that committed transactionscan be preserved; i.e., properly written to disk. Using prior artmethods, disk commits are lengthy procedures, often taking milliseconds.In addition, prior art systems utilizing traditional disk systems mustwrite a complete block, and disk log records will rarely occupy amultiple of a block of data. Given that writing a block of data can belengthy, prior art database systems generally buffer the log and writeit to disk only periodically. Otherwise, if the log were written aftereach modification, the DBMS would be severely limited in transactionprocessing speed.

The process of buffering log writes—sometimes known as “boxcarring”—canreduce the number of transactions that a system must track and commit.However, there are penalties in user response time, lock contention, andmemory usage. In addition, boxcarring can complicate system recovery.

In certain cases, a DBMS can use Remote Direct Memory Access (RDMA) totransfer data between two nodes (computers) of the DBMS. RDMA is atechnology that allows a network interface card (NIC) to transfer datadirectly to or from memory of a remote node without occupying thecentral processing unit (CPU) of either node. It should be noted thatthe term network interface card, as used herein, includes all networkinterfaces, including chips, and units built directly into processors.By way of example, a remote client can register a target memory bufferand send a description of the registered memory buffer to the storageserver. The remote client then issues a read or write request to thestorage server. If the request is a write request, the storage serverperforms an RDMA read to load data from the target memory buffer intothe storage server's local memory. The storage server then causes a disccontroller to write the target data to a storage disk and, once thewrite is complete, generates and sends a write confirmation message tothe remote client. On the other hand, if the request is a read request,the storage server uses the disk controller to perform a block-levelread from disk and loads the data into its local memory. The storageserver then performs an RDMA write to place the data directly into anapplication memory buffer of the remote computer. After an RDMAoperation completes, the remote client deregisters the target memorybuffer from the RDMA network to prevent further RDMA accesses. UsingRDMA increases data throughput, decreases the latency of data transfers,and reduces load on the storage server and remote client's CPU duringdata transfers. Examples of RDMA capable networks include Infiniband,iWarp, RoCE (RDMA over Converged Ethernet), and OmniPath.

FIG. 2 depicts a block representation of a conventional system in whichdata is copied from a pinned buffer in a first node to a second node.The first node 10 includes a pinned buffer 20, a driver 30, and a NIC40. The pinned buffer 20 and the driver 30 are each operatively coupledto the NIC 40. The second node 50 includes a pinned buffer 60, a driver70, and a NIC 80. The pinned buffer 60 and the driver 70 are coupled tothe NIC 80. The NIC 40 is operatively coupled to the NIC 80 via anetwork 90. The driver 30,70 will be accessible by an operating systemrunning on the respective node 10,50.

In operation, the driver 30 in the node 10 writes a descriptor for alocation of the pinned buffer 20 to the NIC 40. The driver 70 in thenode 50 writes a descriptor for the location of the pinned buffer 60 tothe NIC 80. The driver 30 works with the operating system, as well asother software and hardware on the node 10 to guarantee that the buffer20 is locked into physical memory; i.e., “pinned.” The NIC 40 reads datafrom the pinned buffer 20 and sends the read data on the network 90. Thenetwork 90 passes the data to the NIC 80 of the host 50. The NIC 80writes data to the pinned buffer 60.

FIG. 3 is a sequence of steps performed by a prior art network RDMAsystem to receive data over a network. In the first step 91 a NIC 40allocates a pinned memory buffer 20. In step 92 data is received fromthe network (not shown) by the NIC 40 into the pinned memory buffer 20.In step 93 a driver notifies an application 35 that data has beenreceived. In step 94, the application 35 calls into the driver 30(usually through an operating system, which is not depicted) to read thedata in the pinned buffer 20. In step 95, the driver 30 copies the datain the pinned buffer 20 to a separate memory buffer (not depicted)allocated by the application 35.

Serial Advanced Technology Attachment (SATA) is a computer bus interfacethat connects host bus adapters to mass storage devices such as harddrives, optical drives, and solid state drives (SSDs). While SATA workswith SSDs, it is not designed to allow for the significant level ofparallelism that SSDs are capable of. NVMe, NVM (Non-Volatile Memory)Express, or Non-Volatile Memory Host Controller Interface Specification(NVMHCI) is a logical device interface specification for accessingnon-volatile storage media attached via PCI Express (PCIe) bus. NVMeallows parallel access to modern Solid State Drives (SSDs) to beeffectively utilized by node hardware and software.

Microprocessor architectures generally operate on data types of a fixedwidth, as the registers within the microprocessor will all be of a fixedwidth. For example, many modern processors include either 32 bit wide or64 bit wide registers. In order to achieve maximal efficiency, prior tooperating on data, the data must be properly aligned in memory alongboundaries defined by that fixed width.

Storage and network bandwidth are substantial drivers of cost for DBMSsystems. Accordingly, prior art DBMS systems tend to optimize datahandling to conserve storage and network bandwidth. For example, a priorart DBMS system may include a data manipulation step that effectivelycompresses data prior to storing it or transmitting it via the network,and a decompression step when reading data from storage or the network.

OBJECTS OF THE DISCLOSED SYSTEM, METHOD, AND APPARATUS

Accordingly, it is an object of this disclosure to provide aninfrastructure for a DBMS, and an apparatus and method that operatesmore efficiently than prior art systems.

Another object of the disclosure is to provide an efficient networkinfrastructure.

Another object of the disclosure is to provide a network infrastructurefor a DBMS that allows a database application to directly access pinnedRDMA memory.

Another object of the disclosure is to provide a network infrastructurefor a DBMS that manages pinned RDMA memory.

Another object of the disclosure is to provide a storage infrastructurefor a DBMS that allows a database application to directly access storagebuffers used for disk access.

Another object of the disclosure is to provide a storage infrastructurefor a DBMS that that manages pinned DMA memory.

Another object of the disclosure is to provide a DBMS infrastructurethat allows an application to directly access NVME drives.

Another object of the disclosure is to provide a DBMS infrastructurethat allows an application to directly access SATA drives.

Another object of the disclosure is to provide a DBMS infrastructurewhereby data is stored in a format that is usable by a processor withminimal adjustment.

Another object of the disclosure is to provide a DBMS infrastructurewhereby network data is maintained in a format that is usable by aprocessor with minimal adjustment.

Other advantages of this disclosure will be clear to a person ofordinary skill in the art. It should be understood, however, that asystem or method could practice the disclosure while not achieving allof the enumerated advantages, and that the protected disclosure isdefined by the claims.

SUMMARY OF THE DISCLOSURE

A networked database management system along with the supportinginfrastructure is disclosed. The disclosed DBMS is capable of handlingenormous amounts of data—an Exabyte or more—and accessing each recordwithin the database frequently. In one embodiment the disclosed DBMScomprises a first high speed storage cluster including a first pluralityof storage nodes. Each storage node includes a server and one or morestorage drives at a high performance level. The first high speed storagecluster also includes a first switch. The DBMS also comprises a secondhigh speed storage cluster including a second plurality of storagenodes. Each storage node in this second cluster also includes a serverand one or more storage drives that operate at a lower performancelevel. The second high speed storage cluster also includes a secondswitch. The DBMS allows comprises an index cluster including a pluralityof index nodes and a third switch. The first switch is operativelycoupled to the third switch by a high speed RDMA capable link, and thesecond switch is operatively coupled to the third switch by a high speedRDMA capable link.

In a separate embodiment, the disclosed DBMS includes an applicationwith a drive access class. The drive access class includes a NVME driveaccess sub class, and a SATA drive access sub class. The NVME driveaccess sub class allows the application to directly interface with NVMEdrives, and the SATA drive access sub class allows the application todirectly interface with SATA drives. In addition, NVRAM technologies arenow viable, and a specific sub class is allocated to optimize access todrives utilizing NVRAM technology. As future storage technologies areintroduced, additional optimized storage access subclasses can bedeveloped.

For example, a node in a database management system will include a drivecontroller, such as a SATA drive controller that communicates with adrive, such as a SATA hard drive or a SATA solid state drive. Anapplication running on the node will utilize a SATA drive access class,or equivalent code abstraction (such as a function set), to access theSATA drive. The application will create a pinned memory buffer, whichmay in certain circumstances be an RDMA memory buffer. The applicationcan, for example, establish a queue using the pinned memory buffer;i.e., using the pinned memory buffer as the queue elements; with thequeue having a plurality of fixed size entries. The application willthen directly access the SATA drive using SATA drive access class, andwrite data from one of the fixed size entries to the SATA drive.

In a further embodiment, the node will include a network interfacecontroller, which is adapted to receive data into one of the fixed sizedata entries in the queue. For example, the network interface controllercan use RDMA to copy data into the fixed size data entry as explainedherein. Further, the application can then write the received data onto aSATA drive (solid state, spinning magnetic, etc.) from the fixed sizeentry using the SATA drive access class without making any operatingsystem calls.

In another embodiment, the disclosed DBMS incorporates a node. The nodeincludes a network interface card including a controller. A DBMSapplication executing on the node coordinates with the controller toallocate a pinned memory buffer, which is directly accessed by the DBMSapplication, which allows the DBMS application to directly access andmanipulate received network data.

In another embodiment, the disclosed DBMS incorporates a node. The nodeincludes a drive controller. A DBMS application allocates a pinnedmemory buffer, and communicates this buffer to the drive controller,which allows the DBMS application to access and manipulate drive datawith minimal overhead.

BRIEF DESCRIPTION OF THE DRAWINGS

Although the characteristic features of this disclosure will beparticularly pointed out in the claims, the invention itself, and themanner in which it may be made and used, may be better understood byreferring to the following description taken in connection with theaccompanying drawings forming a part hereof, wherein like referencenumerals refer to like parts throughout the several views and in which:

FIG. 1 depicts a prior art simplified network diagram of a cloud basedDBMS system;

FIG. 2 depicts a prior art simplified block diagram of a network basedRDMA system;

FIG. 3 depicts a sequence of steps that a prior art network based RDMAsystem would execute to receive data over a network;

FIG. 4 depicts a simplified network diagram illustrating the disclosedDBMS system and the infrastructure supporting the same;

FIG. 5 depicts a simplified block diagram of a network implementing thedisclosed RDMA memory buffer management system;

FIG. 6 depicts a sequence of steps that a node utilizing the disclosedRDMA memory buffer system will execute to receive data over a network;

FIG. 7 depicts a simplified block diagram of a node implementing thedisclosed DMA memory buffer management system;

FIG. 8 depicts a sequence of steps that a node utilizing the disclosedDMA memory buffer management system will execute to read data fromstorage;

FIG. 9 depicts a prior art data structure for data passed by network orstored on disk;

FIG. 10 depicts a disclosed data structure for data passed by network orstored on disk;

FIG. 11 depicts a simplified class abstraction for a drive access class;and

FIG. 12 depicts a simplified software block diagram of the disclosedDBMS system implementing the disclosed drive access class.

DETAILED DESCRIPTION

This application discloses a number of infrastructure improvements foruse in a database system, along with a DBMS utilizing thoseimprovements. The infrastructure is adapted to allow the database systemto scale to a size of an Exabyte or even larger, and allow each recordstored in the database to be accessed quickly and numerous times perday. Such a database will be useful for many tasks; for example, everybit of data exchanged over a company's network can be stored in such adatabase and analyzed to determine, for example, the mechanism of anetwork intrusion after it has been discovered.

One issue that such a database must overcome is data access latency.There are numerous sources of data access latency, including, forexample, network accesses, disk accesses, memory copies, etc. Theselatencies are exacerbated from distributed systems.

Turning to the Figures, and to FIG. 4 in particular, an embodiment of anetwork architecture 100 for supporting a high performance databasesystem is disclosed. The network architecture 100 comprises threestorage clusters; a blazing storage cluster 105, a hot storage cluster115, and a warm storage cluster 125. In addition, the networkarchitecture comprises an index cluster 135. Each cluster comprisesmultiple nodes. For example, the blazing storage cluster 105 is depictedas comprising five blazing storage nodes 101; the hot storage cluster115 is depicted as comprising five hot storage nodes 111, and the warmstorage cluster 125 is depicted as comprising five warm storage nodes121. In addition, the index cluster 135 is depicted as comprising threeindex nodes 131. While the clusters 105,115,125,135 are each depictedwith a specific number of nodes, it should be noted that each cluster105,115,125,135 can comprise an arbitrary number of nodes.

A blazing storage node 101 may include, for example, an array of NVDIMM(Non-Volatile Dual Inline Memory Module)(a type of NVRAM) storage, suchas that marketed by Hewlett Packard Enterprise, or any other extremelyfast storage, along with appropriate controllers to allow for full speedaccess to such storage. For example, DRAM with a write ahead logimplemented on an NVMe drive could be utilized to implement blazingstorage. Specifically, write-ahead logging is used to log updates to anin-memory (DRAM) data structure. As log entries are appended to the endof the log, they are flushed to an NVMe drive when the size of in-memorylog entries is near or exceeds the size of a Solid State Memory page, orafter a configured timeout threshold, such as ten seconds, has beenreached. This guarantees an upper bound for the amount of data that islost if a power outage or system crash should occur; this also allowsfor the entire in-memory structure from disk log to be rebuilt after arestart. Since the log is written sequentially to the SSD, writeamplification on the Solid State drive is minimized.

In addition, NVRAM technology can also be utilized to implement blazingstorage. In such a case, the in-memory structures would be preserved inthe event of a power outage. A hot storage node 111 may include, forexample, one or more Solid State NVME drives, along with appropriatecontrollers to allow for full speed access to such storage. A warmstorage node 121 may include, for example, one or more Solid State SATAdrives, along with appropriate controllers to allow for full speedaccess to such storage.

Each index node 131 will also include storage, which will generallycomprise high performance storage such as Solid State SATA drives orhigher performance storage devices. Generally, the index nodes 131 willstore the relational database structure, which may comprise, forexample, a collection of tables and search keys.

To allow for information exchange as fast as possible, certain of theclusters are connected via high speed, RDMA capable links 108. Inparticular, the index cluster 135 is connected to the storage clusters105,115,125 by high speed, RDMA capable links 108. On the other hand,the storage clusters 105,115,125 are connected to one another bystandard (non-RDMA capable) high performance network links 109, such as10 Gbps Ethernet.

As discussed above, Infiniband is an example of a high speed, RDMAcapable link. Importantly, such links allow different nodes in eachcluster to exchange information rapidly; as discussed above, informationfrom one node is inserted into the memory of another node withoutconsuming processor cycles of the target node. The blazing storagecluster 105 also comprises a high speed switch 103. Each blazing storagenode 101 is operatively coupled to the high speed switch 103 through ahigh speed, RDMA capable link 108. Similarly each hot storage node 111is coupled to a high speed switch 113 through a high speed, RDMAcapable, link 108, and each warm storage node 121 is coupled to the highspeed switch 123 through a high speed, RDMA capable, link 108.Similarly, the high speed switches 103,113,123 coupled to each storagecluster 105,115,125 are each coupled to the high speed switch 133 of theindex cluster 135 by a high speed, RDMA capable, link 108.

Turning to FIG. 5, the specific RDMA memory infrastructure used by thedisclosed database infrastructure is depicted. In particular, a datastorage node 200 includes an RDMA capable network interface card 206which communicates with other devices over an RDMA capable network 210.The network interface card 206 receives data directly into a pinnedmemory buffer 202. A data store application 204 directly accesses thepinned memory buffer 202. In particular, the data store app 204 managesthe pinned memory buffer 202; in particular, the data store application204 will utilize a management algorithm and structure, such as a queue220 to manage data.

Similarly, an index node 300 includes an RDMA capable network interfacecard 306, which communicates with other devices over an RDMA capablenetwork 210. The network interface card 306 receives data directly intoa pinned memory buffer 302. An index app 304 directly accesses thepinned memory buffer 302, which can be managed as a queue 320.

Each entry in the queues 220, 320 will be a fixed memory size, such as,for example, 4 kilobytes. In a preferred embodiment of the disclosedDBMS and associated infrastructure, the size of a queue entry will beset to the maximal size of any network message expected to be passedbetween nodes. As data is received by the node (data or index) thecorresponding application (database or index) directly operates on theinformation without copying it. As data is used, and no longer needed, aqueue pointer is advanced, so that the no longer needed data can beoverwritten. Accordingly, the application directly manages the RDMAbuffer. This is in contrast to prior art systems, where the RDMA buffersare managed by a driver, and accessed through the operating system.

The steps by which network data is received by the disclosed DBMS systemis illustrated in FIG. 6. In step 331 the application 204 coordinateswith a NIC 206 to allocate a pinned memory buffer 202. Next, in step332, the application 204 allocates the pinned buffer 202 into a managedqueue 220 comprising multiple fixed size buffers. In step 333, the NIC206 receives data from the network (not shown) resulting in a filledmemory buffer 220A. In step 334, the application 204 operates directlyon the received data within the filled memory buffer 220A; i.e.,directly within the pinned memory buffer 202. In step 335, application204 manages the queue by advancing a receive pointer to, for example,empty buffer 220B. Accordingly, when data is received by the NIC 206, itwill be written to an empty buffer 220B. In step 336, application 204recycles a no longer needed buffer, such as buffer 220 a when theapplication is done with it, by, for example, placing it at the tail ofthe queue, resulting in a recycled buffer 220C.

Turning to FIG. 7, a specific disk buffer management scheme for use withthe disclosed DBMS is depicted. In particular, a data storage node 200comprises a disk controller 406. The disk controller 406, which can beincorporated into a microprocessor or chipset on the data storage node's200 motherboard (not shown), or be an independent card, is coupled tostorage device 410, which can be a single drive (of any type), a RAIDarray, a storage area network, or any other configuration of storage.The disk controller 406 accesses a pinned memory buffer 402 for DMAmemory access for reads. A computer software application 404 (whichcould be a data store application, index application or another type ofapplication) directly accesses and manages the pinned memory buffer 402.In particular, the application 404 manages the pinned memory buffer 402as a queue 420.

Each entry in the queue 420 will be a fixed memory size, such as, forexample, 4 kilobytes. In a preferred embodiment of the disclosed DBMSand associated infrastructure, the size of a queue entry will be set tothe same as, or an integer multiple of, the page size of the storagedrives used by a particular node. As data is read from storage 410 theapplication 404 directly operates on the data without copying it. Asdata is used, and no longer needed, a queue pointer is advanced, so thatthe no longer needed data can be overwritten. Accordingly, theapplication directly manages the DMA buffer. This is in contrast toprior art DBMS systems, where DMA buffers are managed by a driver, andaccessed through the operating system.

The steps by which data is read by a node from storage into memory forthe disclosed DBMS system is set forth in FIG. 8. In step 431 anapplication 404 allocates a pinned memory buffer 402. Next, in step 432,the application 404 allocates the pinned buffer 402 into a managed queue420 comprising multiple fixed size buffers. In step 433, the diskcontroller 406 reads data from storage (not shown) resulting in a filledmemory buffer 420A. In step 434, the application 404 operates directlyon the read data within the filled memory buffer 420A; i.e., directlywithin the pinned memory buffer 402. For example, the application 404could calculate an arithmetic mean of the data in the pinned memorybuffer, or some other operation. In step 435, application 404 managesthe queue by advancing a read pointer to, for example, empty buffer420B. Accordingly, when data is next read by the disk controller 406, itwill be written to an empty buffer 420B. In step 436, application 404recycles a no longer needed buffer, such as buffer 420A when theapplication is done with it, by, for example, placing it at the tail ofthe queue, resulting in a recycled buffer 420C.

While the applications of the disclosed DBMS system directly managingRDMA buffers will substantially improve performance of the DBMS systemversus prior art implementations, additional improvements can still bemade. In particular, prior art systems typically maintain data inpacked, or compressed, format. For example, to conserve disk storage andnetwork bandwidth, most systems will pack Boolean type data into asingle bit. In addition, many systems will actually apply simplecompression before transmitting data via a network or committing it todisk storage. Prior art systems do this to ensure that network bandwidthand persistent storage—both of which are scarce resources—are usedefficiently.

An example of a prior art data structure 500 is shown in FIG. 9. Asdepicted, each row comprises 64 bits, which assumes that the underlyingprocessor is a 64 bit processor. It should be noted that this choice wasmade solely for purposes of illustration, and not to limit theapplication of this disclosure to 64 bit processors. The first row ofthe disclosed prior art data structure 500 comprises a header 501, andthe second row comprises two longword entries 502,503, where a longworddata entry is understood to be 32 bits. The third row comprises two wordentries 504, 505, where a word entry is understood to be 16 bits, and alongword entry 506. The fourth row comprises eight byte entries507,508,509,510,511,512,513,514, and the fifth row comprises sixty fourbit entries 515.

The prior art data structure 500 efficiently stores 78 data entries in amere 320 bits; in fact, not a single bit in the prior art data structure500 is unused. However, each piece of data in the prior art datastructure with the exception of the header will require multipleoperations prior to being used. For example, long word 503 must becopied to a separate location, and the upper 32 bits masked off prior tobeing copied into a register and operated on. Each of these operationswill use precious processor time in the interest of conserving memory,network bandwidth, and disk storage.

The disclosed DBMS system, however, is not optimized to minimize networkusage or disk storage. Rather, the disclosed DBMS system is optimized tomaximize performance. Accordingly, each data entry is stored in a 64 bitmemory location, as depicted in a simplified fashion in FIG. 10. Thisallows for data to be operated on immediately by the processor withoutany additional operations. While an enormous amount of memory will beused—4992 bits instead of the 320 bits that the prior art data structureutilizes to store the same data—many fewer processor cycles will berequired to operate on the same data, as all of it is properly alignedto be operated on by the processor. In particular, each data field willbe copied directly to a processor register, operated on, and copiedback—no format translation is required.

Another way in which the disclosed DBMS system optimizes performance isby the various DBMS applications directly accessing NVME, NVRAM and SATAdrives. This is done through an abstraction layer, which is generallyillustrated in FIG. 11. In particular, a drive access class 602 providesstorage access services to an application (not shown), into which it isincorporated or accessible. The drive access class 602 includes an NVMEdrive access class 604, a SATA drive access class 606, and, an NVRAMdrive access class 608

FIG. 12 further describes the disclosed drive abstraction and its novelinterface with the disclosed DBMS system. In particular, a pinned memorybuffer 702 is directly accessed by a DBMS application 704, which couldbe a database application, an index store application, or another typeof application. The application incorporates the drive access class 602.When the application 704 seeks to access an NVME drive 710, the driveaccess class 602 utilizes the NVME drive access class 604, whichdirectly accesses a NVME controller 706. The NVME controller 706 iscoupled to the NVME drive 710, which it controls. On the other hand,when the application 704 seeks to access a SATA drive 712, the driveaccess class 602 utilizes the SATA drive access class 606, whichdirectly accesses a SATA controller 708. The SATA controller 708 iscoupled to the SATA drive 712, which it controls. And, when theapplication 704 seeks to access an NVRAM Drive 714, the drive accessclass 602 utilizes an NVRAM drive access class 608, which directlyaccess a NVRAM controller 709.

Storage access services include opening a file, reading a file, orwriting a file. Generally, such services are provided through anoperating system, which utilizes a device specific driver to interfacewith the particular device. Operating system calls are time consuming,and thereby decrease the performance of DBMS systems that utilize them.Rather than suffering such performance penalties, the disclosed DBMSsystem directly accesses NVRAM, NVME and SATA controllers, as disclosedherein.

The foregoing description of the disclosure has been presented forpurposes of illustration and description, and is not intended to beexhaustive or to limit the disclosure to the precise form disclosed. Thedescription was selected to best explain the principles of the presentteachings and practical application of these principles to enable othersskilled in the art to best utilize the disclosure in various embodimentsand various modifications as are suited to the particular usecontemplated. It is intended that the scope of the disclosure not belimited by the specification, but be defined by the claims set forthbelow. In addition, although narrow claims may be presented below, itshould be recognized that the scope of this invention is much broaderthan presented by the claim(s). It is intended that broader claims willbe submitted in one or more applications that claim the benefit ofpriority from this application. Insofar as the description above and theaccompanying drawings disclose additional subject matter that is notwithin the scope of the claim or claims below, the additional inventionsare not dedicated to the public and the right to file one or moreapplications to claim such additional inventions is reserved.

What is claimed is:
 1. A node in a database management system, the nodecomprising: a) a drive controller; b) an application including a driveaccess class; c) a plurality of pinned, fixed-size memory buffersallocated by the application; d) a drive; e) wherein the applicationestablishes a queue of the pinned, fixed-size memory buffers; f) whereinthe application dequeues a single entry from the queue; g) wherein theapplication directly accesses the drive controller using the driveaccess class; and h) wherein the application reads data from the driveinto the single entry using the drive access class.
 2. The node of claim1 wherein the drive controller is a SATA drive controller, the driveaccess class is a SATA drive access class, and the drive is a SATAdrive.
 3. The node of claim 1 wherein the drive controller is a NVMEdrive controller, the drive access class is a NVME drive access class,and the drive is a NVME drive.
 4. The node of claim 1 wherein the pinnedmemory buffers are each an RDMA memory buffer.
 5. The node of claim 1further comprising a network interface controller, and wherein thenetwork interface card is adapted to receive data into one of the fixedsize entries creating a filled data entry.
 6. The node of claim 5wherein the application writes the data from the filled data entry tothe drive using the drive access class.
 7. A node in a databasemanagement system, the node comprising: a) a drive controller; b) anapplication including a drive access class; c) a plurality of pinned,fixed-size memory buffers allocated by the application; d) a drive; e)wherein the application establishes a queue of the pinned, fixed-sizememory buffers; f) wherein the application dequeues a single entry fromthe queue; g) wherein the application directly accesses the drivecontroller using the drive access class; and h) wherein the applicationfills the single entry with data and writes the single entry to thedrive using the drive access class.
 8. The node of claim 7 wherein thedrive controller is a SATA drive controller, the drive access class is aSATA drive access class, and the drive is a SATA drive.
 9. The node ofclaim 7 wherein the drive controller is a NVME drive controller, thedrive access class is a NVME drive access class, and the drive is a NVMEdrive.